The receptive field of a Convolution Neural Network (CNN) refers to the specific region in the input data that influences the activation of a particular neuron in the network. In other words, it’s the area in the input that contributes to the output calculation of a particular neuron (Li, Iosifidis, and Zhang 2022). The size of this region determines the level of feature abstraction where larger receptive fields can capture more global patterns while smaller ones can focus on local details.
Different sizes of receptive fields may be optimal for different tasks. The effective receptive field (the region that actually has significant influence on the output) cam differ from the theoretical receptive field (calculated mathematically) and typically shows a Gaussian-like distribution of influence, with pixels near the center having more impact than those at the edges (Luo et al. 2017).
Different tasks may require different receptive field configurations. For example, tasks involving fine details might require networks with smaller receptive fields, while tasks requiring more global context could benefit from architectures with larger receptive fields. However, it is important to note that the effective receptive field is a property of the network’s architecture and not directly dependent on the input image size itself.
In this study, we aim to experiment different configurations of convolutional filters and network architectures that affect the effective receptive field using different convolutional filter sizes and compare the number of parameters alongside the compute cost.
Introduction
The history of Convolution Neural Networks (CNN) is a rich one and one that spans multiple decades. CNN architecture is heavily inspired by the biology of the visual cortex, which contains arrangements of simple and complex cells that activate based on specific subregions of the visual field, known as receptive fields (Indolia et al. 2018). CNNs have made much of the buzz as well as many advancements in deep learning in recent years, particularly in the domain of computer vision and image analysis. Its application can be found in industrial settings like in fault detection, in medical imaging for disease detection through classification (Yamashita et al. 2018), or in consumer situations like face detection in digital cameras or object detection in driving systems. Moreover, researchers Gu et al. point out about the rapid growth of annotated data and the significant improvements in the processing power of graphics processor units (GPUs) which has empowered research on CNNs even further leading to many state-of-the-art models and their results on various tasks. (Gu et al. 2018)
There are numerous models and architectures of CNN. A review article by Cong and Zhou goes into details about the 6 typical architectures of CNN. A typical CNN consists of five basic layers which are the input layer, the convolution layer, the pooling layer, the Fully Connected (FC) layer (also known as the Dense layer) and the output layer. Early advancements in CNN design are exemplified by ImageNet, more commonly known as AlexNet, which introduced the ReLu activation function instead of tanh to reduce computation complexity and resolve the vanishing gradient problem. Moreover, the use of Local Response Normalization (LRN) and pooling (down-sampling) were also helpful in improving performance as well as reducing overfitting. There’s also mention of ZfNet or DeconvNet where researchers provided visualizations to understand the internal mechanism of a CNN architecture.
The authors then talk about VGG Net where researchers used smaller 3x3 CNN filters and deeper architectures with increased network depth. They also mention GoogLeNet and its use of multiple filters of varying sizes stacked together to extract features of different sizes. The authors point out that Batch Normalization was a major contribution with the introduction of the InceptionV2 model which improved backpropagation and the gradient vanishing problem by stabilizing gradient flow, addressing issues related to internal covariate shift.
Then there’s the idea of cross-layer connection which is different from prior architectures where connections existed only between adjacent layers. Residual Networks (ResNet) implemented this cross-layer connection through shortcut connections where the input is transferred across multiple layers enabling feature aggregation across non-adjacent layers. This is an example of widening the network which is opposite to that of deepening the network. Finally, there’s info about the DenseNet which is a CNN architecture with dense connections where each layer is connected to every other layer. A down side of this is that the model relearns the same features repeatedly and there’s increased computational overhead.
Lightweight architectures, such as MobileNetV3, are targeted towards mobile devices or embedded devices. Lightweight CNNs are smaller CNN architectures obtained after compression (reducing model size) and acceleration (increasing inference speed). MobileNetV3 is a lightweight CNN that decompose the convolution process into pointwise and depthwise convolution for decreasing model size as well as compute burden.
The fifth type of CNN architecture deals with object detection. The article introduces Regional CNN (R-CNN) which achieve object detection through three main ideas which are region proposals, feature extraction and region classification. Inspired by this and based on GoogLeNet, the YOLO model, which is You Only Look Once, segments an input image into multiple bounding boxes and classifies the regions alongside the bounding boxes to detect objects.
Finally, the sixth and the last architecture is the Transformer Encoder where the article looks into Vision Transformers (ViT). Since transformers lack spatial inductive biases which leads to poor optimization, researchers have combined CNNs and ViTs to overcome it (Cong and Zhou 2023). Collectively, these advancements underscore the dynamic interplay between architectural innovation and optimization in CNN evolution.
CNNs are inherently compute heavy because they contain a lot of parameters and structural redundancy with deeper network architectures for feature extraction. Recent advancements in lightweight deep convolutional neural networks (DCNNs) have primarily focused on optimizing network architecture to balance accuracy and computational efficiency. One of the primary strategies for achieving this balance is depthwise separable dilated convolutions, as presented in this Dual Residual and Large Receptive Field Network (DRLN) novel architecture by (Pan et al. 2024). Dilated convolutions introduce holes or spaces between kernel elements, effectively increasing the receptive field without increasing the number of parameters. ShuffleNet further enhances this approach by incorporating group convolutions and channel shuffling to improve information flow while maintaining efficiency (Zhang et al. 2017).
Other architectural innovations include EfficientNet’s compound scaling method, which uniformly scales depth, width, and resolution to optimize efficiency (Tan and Le 2020). Furthermore, NAS (Neural Architecture Search)-designed models have demonstrated promising results in automating the design of lightweight networks, with MobileNetV3 integrating NAS techniques to improve upon its predecessors (Cong and Zhou 2023).
Another key direction in lightweight DCNNs is model compression, which reduces the memory footprint and computational burden without significantly sacrificing accuracy. Pruning techniques, such as structured and unstructured pruning, remove redundant parameters by eliminating less important weights or entire channels (Iofinova et al. 2022). Quantization, which reduces the precision of model parameters from floating-point to lower-bit representations (e.g., 8-bit integer), is another widely used technique that significantly reduces memory consumption and inference latency (Pan et al. 2024). Two widely used quantization methods are post-training quantization (PTQ) and quantization-aware-training (QAT) with PTQ being the more popular one since it can restore the performance of the model without requiring the original training pipeline (Chen et al. 2024).
Knowledge distillation has also gained traction as an effective compression method, wherein a large, well-trained teacher network transfers knowledge to a smaller student network, improving the student’s performance while keeping it computationally lightweight (Y. Wang 2022). Additionally, low-rank decomposition methods decompose weight matrices into smaller, efficient representations, further enhancing compactness.
The receptive field plays a critical role in the performance of CNNs, as it determines how much context a neuron captures. Recent studies (Luo et al. 2017) highlight that the effective receptive field (ERF) is significantly smaller than the theoretical receptive field, limiting the network’s ability to integrate global information efficiently. This phenomenon has implications for lightweight CNNs, where reducing the number of parameters can inadvertently shrink the ERF, potentially affecting recognition accuracy.
Several strategies have been proposed to mitigate this issue in lightweight networks. Dilated convolutions, for example, expand the receptive field without increasing parameter count, making them particularly useful for mobile-friendly models. Additionally, skip connections, as seen in ResNet architectures, help maintain feature propagation across layers, addressing the limitations imposed by a reduced ERF.
Benchmarking lightweight CNNs requires comprehensive evaluation across multiple datasets and hardware platforms. Common benchmark datasets such as ImageNet, CIFAR-10, and COCO are widely used to assess classification and detection performance (Pan et al. 2024),(A. N. Wang et al. 2024). However, real-world deployment also necessitates testing on edge devices to measure latency, power consumption, and memory efficiency.
Despite significant advancements, several challenges remain in designing lightweight CNNs. One area of interest is the development of hardware-aware NAS techniques that tailor models to specific device constraints. Additionally, improving the trade-off between accuracy and efficiency through novel loss functions and training paradigms remains an open research problem. Finally, better understanding and optimization of ERF in lightweight models can further enhance their effectiveness in real-world applications.
By addressing these challenges, future research can drive further improvements in lightweight CNNs, making them even more suitable for deployment in resource-constrained environments such as mobile devices, autonomous systems, and IoT applications.
Dataset
For this project we decided to use a dataset based on the Carla (Car Learning to Act) Driving Simulator, an open-source simulator designed for autonomous driving research. Carla simulates driving scenarios that are applicable for training models for tasks like lane detection, object detection, and semantic segmentation.
Annotation: Each image has a corresponding label file, where lanes are marked with 1 & 2 (left & right lane pixels) and the background is marked with 0 (non-lane pixels).
Total samples: 6408 png files = 3075 train + 3075 train_label + 129 val + 129 val_label images
Dataset Visualization
Example of one of the images and the corresponding label mask.
The label images are grayscale pngs with 3 pixel values, which are 0,1,2.
The following image is the label image visualized for human eyes with visually distinct color values. Here we can see the lane lines which serve as the label for training.
Exploring summary statistics of the dataset images
Since the images contain all three RGB channels, we perform some statistics to understand the distribution of each individual RGB channel as well as the RGB colors themselves.
Code
import numpy as npfrom PIL import Imagefrom skimage import color as skcolorfrom collections import Counterimport matplotlib.pyplot as pltdef analyze_image(image_path): img = Image.open(image_path) img_array = np.array(img) rgb_means = np.mean(img_array, axis=(0, 1)) rgb_medians = np.median(img_array, axis=(0, 1)) rgb_mins = np.min(img_array, axis=(0, 1)) rgb_maxs = np.max(img_array, axis=(0, 1)) rgb_stds = np.std(img_array, axis=(0, 1))# Color distribution rgb_hist = [np.histogram(img_array[:,:,i], bins=256, range=(0,256))[0] for i inrange(3)]# Dominant colors (top 5) pixels = img_array.reshape(-1, 3) counts = Counter(map(tuple, pixels)) dominant_colors = counts.most_common(5)# Luminosity - the array of number is based on how the human eye perceives brightness in the RGB channels luminosity = np.dot(rgb_means, [0.299, 0.587, 0.114])print(f"RGB Means: R={rgb_means[0]:.2f}, G={rgb_means[1]:.2f}, B={rgb_means[2]:.2f}")print(f"RGB Medians: R={rgb_medians[0]:.2f}, G={rgb_medians[1]:.2f}, B={rgb_medians[2]:.2f}")print(f"RGB Mins: R={rgb_mins[0]}, G={rgb_mins[1]}, B={rgb_mins[2]}")print(f"RGB Maxs: R={rgb_maxs[0]}, G={rgb_maxs[1]}, B={rgb_maxs[2]}")print(f"RGB Standard Deviations: R={rgb_stds[0]:.2f}, G={rgb_stds[1]:.2f}, B={rgb_stds[2]:.2f}")print(f"Luminosity: {luminosity:.2f}")print("\n\nTop 5 Dominant Colors (RGB, count):")for rgb_vals, count in dominant_colors:print(f" {rgb_vals[0]:4d}{rgb_vals[1]:4d}{rgb_vals[2]:4d} : {count}")# Plot histograms fig, axs = plt.subplots(3, 1, figsize=(10, 10)) colors = ['red', 'green', 'blue']for i, (hist, color) inenumerate(zip(rgb_hist, colors)): axs[i].bar(range(256), hist, color=color, alpha=0.7) axs[i].set_title(f'{color.capitalize()} Channel Histogram') plt.tight_layout() plt.show()# Usageanalyze_image("C:/Users/deka_/Desktop/Testing/Datasets/carla-lane/train/Town04_Clear_Noon_09_09_2020_14_57_22_frame_569.png")
Summary statistics of the RGB channels of the sample.
Code
print(f"RGB Means: R = {round(np.mean(all_R)):>3}, G = {round(np.mean(all_G)):>3}, B = {round(np.mean(all_B)):>3}")print(f"RGB Medians: R = {round(np.median(all_R)):>3}, G = {round(np.median(all_G)):>3}, B = {round(np.median(all_B)):>3}")print(f"RGB Mins: R = {round(np.min(all_R)):>3}, G = {round(np.min(all_G)):>3}, B = {round(np.min(all_B)):>3}")print(f"RGB Maxs: R = {round(np.max(all_R)):>3}, G = {round(np.max(all_G)):>3}, B = {round(np.max(all_B)):>3}")print(f"RGB Std Devs: R = {round(np.std(all_R)):>3}, G = {round(np.std(all_G)):>3}, B = {round(np.std(all_B)):>3}")
RGB Means: R = 159, G = 163, B = 163
RGB Medians: R = 182, G = 183, B = 186
RGB Mins: R = 0, G = 0, B = 0
RGB Maxs: R = 246, G = 251, B = 252
RGB Std Devs: R = 51, G = 47, B = 51
The top 5 RGB colors in the sample and their count.
Code
for rgb_vals, count in all_dominants.most_common(5):print(f" {rgb_vals[0]:4d}{rgb_vals[1]:4d}{rgb_vals[2]:4d} : {count}")
The main objective of this project is to compare 3×3, 5×5, and 7×7 convolution filters within two architectures: a basic CNN-based encoder-decoder and the U-Net model (Ronneberger, Fischer, and Brox 2015). We aim to evaluate how filter size affects segmentation accuracy, training and evaluation time, and overall compute efficiency. The main machine learning task is semantic segmentation, specifically detecting lanes on a road. Unlike image classification where a label is assigned to an entire image, semantic segmentation goes a step further by classifying each pixel in the image to some class label.
Standard convolution filters with strides downsample an image to extract hierarchical features. This happens in the encoding phase. To restore spatial context in the decoding phase (or upsampling phase), transposed convolution filters, also known as deconvolution or deconv layers, are used to upsample feature maps for either reconstruction or decoding.
Semantic segmentation requires a model to output the final result in the same dimension as the input image. This means that the spatial information from the input needs to be carried over (or propagated) throughout the network. The final result is a label map or mask where different pixel values represent different class labels.
The following steps outline the experimental setup:
1. Network Architecture
1.a CNN Encoder-Decoder
Encoder - The encoder comprises five sequential blocks, each containing:
Strided Convolution Layers:
Kernel size: {3×3, 5×5, 7×7} - 3 variants of the model
Stride 2 with padding {1, 2, 3} for the 3 variants
Channel expansion: 3→32→64→128→256→512
Batch normalization (BN) layer after each convolution followed by ReLU activation
The encoder downsamples the spatial dimensions by \(2^5\) (32x downscaling).
Decoder - The decoder mirrors the encoder structure using transposed convolutions:
Upsampling Layers:
Kernel-matched transposed convolutions (stride 2)
Output padding (1) to recover original resolution
Channel reduction: 512→256→128→64→32→3
BN + ReLU combinations in all layers except final output
Encoder (Contracting Path) - The encoder consists of four sequential blocks, each containing:
Double Convolutional Layers:
Each block applies two consecutive convolutional layers:
Kernel size: {3×3, 5×5, 7×7} - 3 variants of the model
Channel expansion: 3→64→128→256→512
Feature Normalization: Batch normalization (BN) after each convolution
Non-linear Activation: ReLU activation after each BN
Downsampling:
Max pooling (2×2, stride 2) after each block halves the spatial resolution
The encoder downsamples the spatial dimensions by a factor of \(2^4\) (16× downscaling).
Bottleneck
Two convolutional layers (with BN and ReLU) at the bottleneck
Channels: 512→1024
Decoder (Expanding Path) - The decoder mirrors the encoder structure and restores the original resolution using:
Upsampling Layers:
Transposed convolutions (kernel size 2×2, stride 2) double the spatial dimensions at each step
Channel reduction: 1024→512→256→128→64
Skip Connections: At each upsampling stage, the upsampled feature map is concatenated with the corresponding encoder feature map (skip connection) for spatial context
Double Convolutional Layers: After concatenation, two convolutional layers (with BN and ReLU) refine the combined features
Output Layer
Reconstruction Layer:
A final 1×1 convolution layer with 64 in channels maps the features to the desired number of output channels (3 in our case)
Produces an output with the same spatial dimensions as the input
2. Training Configuration
All three — 3×3, 5×5, and 7×7 — convolution filter variants of both the encoder-decoder and UNet models were trained for 10 epochs. We used PyTorch dataloaders to train in batches using a batch size of 25. Given that the training set contains 3,075 samples, there were 123 batches per epoch. For the loss function, we used PyTorch’s built-in cross-entropy loss, which internally applies softmax followed by negative log likelihood (NLL). This is standard for multi-class segmentation tasks and has been used in many state-of-the-art models (Yeung et al. 2022).
For validation, we used the cross entropy loss function again to calculate the loss between the model’s predictions and the ground truth segmentation masks. Moreover, we also calculate the mean Intersection over Union (mIoU) and the Dice coefficient as they’re both widely used metrics in segmentation. Both these metrics calculate the overlap between the predicted masks and the ground truth masks. The Dice coefficient is particularly better for segmentation tasks with class imbalance because it gives double the weight to positive overlap i.e. true positives (Kato and Hotta 2024). In our case, the lane markings on a road occupy a smaller portion in the image.
Analysis and Results
Based on architecture of the models, the U-Net models have significantly more parameters than the Encoder-Decoder models.
Model
Parameters
Difference with UNet
CNN 3x3
3,139,587
UNet 3x3
31,037,763
27,898,176
CNN 5x5
8,713,219
UNet 5x5
81,241,411
72,528,192
CNN 7x7
17,073,667
UNet 7x7
156,546,883
139,473,216
The CNN Encoder-Decoder models are roughly 10% the size of their counterpart U-Net models.
Peak memory refers to the memory actively utilized during tensor operations (calculations), while reserved memory includes both the memory set aside for computations and additional memory cached for repeated operations. Since the U-Net models are larger, they take up more memory and longer to train.
When comparing CNN-3 and UNet-3, the CNN-3 model completed training in 5 minutes, while the UNet-3 model took 8 minutes.
Interestingly, the gap in memory usage and training time is significantly larger with the 7x7 models. However, the increase in training time from CNN-3 to CNN-7 is fairly tiny, despite the CNN-7 having nearly six times as many parameters as CNN-3.
For model performance we look at the dice coefficient for both the model types.
Similarly, we can also look at the IoU comparison for both model types.
Based on the Dice & IoU comparison charts, it is clear that the CNN-5 & UNet-3 models are the best performers. The UNet-3 performs slightly better than the CNN-5 given that the dice score .885 > .85 and the IoU score of .805 > .755.
Predictions using CNN-5 & UNet-3
Conclusion
This study explored how different convolutional filter sizes— 3×3, 5×5, and 7×7— affect the performance and efficiency of CNN-based Encoder-Decoder and U-Net architectures for lane detection through semantic segmentation. Our findings show that while larger filter sizes naturally increase model complexity and parameter count, they do not always yield proportionally better segmentation performance.
Notably, the U-Net models which are more parameter heavy and require more memory, generally outperformed the CNN Encoder-Decoder models in segmentation accuracy, particularly the 3×3 model. The UNet-3 model achieved the highest Dice coefficient and IoU which means it’s the best performer in pixelwise classification and had better preservation of spatial features likely due to its deeper architecture and skip connections.
However, the CNN-5 model emerged as the better middle ground. It is 90% smaller than the U-Net while still delivering competitive accuracy, making it suitable for environments where computational efficiency is critical.
References
Chen, Fanghui, Shouliang Li, Jiale Han, Fengyuan Ren, and Zhen Yang. 2024. “Review of LightweightDeepConvolutionalNeuralNetworks.”Archives of Computational Methods in Engineering 31 (4): 1915–37. https://doi.org/10.1007/s11831-023-10032-z.
Cong, Shuang, and Yang Zhou. 2023. “A Review of Convolutional Neural Network Architectures and Their Optimizations.”Artificial Intelligence Review 56 (3): 1905–69. https://doi.org/10.1007/s10462-022-10213-5.
Gu, Jiuxiang, Zhenhua Wang, Jason Kuen, Lianyang Ma, Amir Shahroudy, Bing Shuai, Ting Liu, et al. 2018. “Recent Advances in Convolutional Neural Networks.”Pattern Recognition 77 (May): 354–77. https://doi.org/10.1016/j.patcog.2017.10.013.
Indolia, Sakshi, Anil Kumar Goswami, S. P. Mishra, and Pooja Asopa. 2018. “Conceptual Understanding of ConvolutionalNeuralNetwork- ADeepLearningApproach.”Procedia Computer Science 132: 679–88. https://doi.org/10.1016/j.procs.2018.05.069.
Iofinova, Eugenia, Alexandra Peste, Mark Kurtz, and Dan Alistarh. 2022. “How WellDoSparseImageNetModelsTransfer?” In 2022 IEEE/CVFConference on ComputerVision and PatternRecognition (CVPR), 12256–66. New Orleans, LA, USA: IEEE. https://doi.org/10.1109/CVPR52688.2022.01195.
Kato, Sota, and Kazuhiro Hotta. 2024. “Adaptive t-vMF Dice Loss: An Effective Expansion of Dice Loss for Medical Image Segmentation.”Computers in Biology and Medicine 168 (January): 107695. https://doi.org/10.1016/j.compbiomed.2023.107695.
Li, Nan, Alexandros Iosifidis, and Qi Zhang. 2022. “Collaborative Edge Computing for Distributed CNN Inference Acceleration Using Receptive Field-Based Segmentation.”Computer Networks 214 (September): 109150. https://doi.org/10.1016/j.comnet.2022.109150.
Luo, Wenjie, Yujia Li, Raquel Urtasun, and Richard Zemel. 2017. “Understanding the EffectiveReceptiveField in DeepConvolutionalNeuralNetworks.” arXiv. https://doi.org/10.48550/arXiv.1701.04128.
Pan, Lulu, Guo Li, Ke Xu, Yanheng Lv, Wenbo Zhang, Lingxiao Li, and Le Lei. 2024. “Dual Residual and Large Receptive Field Network for Lightweight Image Super-Resolution.”Neurocomputing 600 (October): 128158. https://doi.org/10.1016/j.neucom.2024.128158.
Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. 2015. “U-Net: ConvolutionalNetworks for BiomedicalImageSegmentation.” arXiv. https://doi.org/10.48550/arXiv.1505.04597.
Wang, Alex N., Christopher Hoang, Yuwen Xiong, Yann LeCun, and Mengye Ren. 2024. “PooDLe: Pooled and Dense Self-Supervised Learning from Naturalistic Videos.” arXiv. https://doi.org/10.48550/arXiv.2408.11208.
Wang, Yan. 2022. “Edge-Enhanced FeatureDistillationNetwork for EfficientSuper-Resolution.” In 2022 IEEE/CVFConference on ComputerVision and PatternRecognitionWorkshops (CVPRW), 776–84. New Orleans, LA, USA: IEEE. https://doi.org/10.1109/CVPRW56347.2022.00093.
Yamashita, Rikiya, Mizuho Nishio, Richard Kinh Gian Do, and Kaori Togashi. 2018. “Convolutional Neural Networks: An Overview and Application in Radiology.”Insights into Imaging 9 (4): 611–29. https://doi.org/10.1007/s13244-018-0639-9.
Yeung, Michael, Evis Sala, Carola-Bibiane Schönlieb, and Leonardo Rundo. 2022. “Unified Focal Loss: GeneralisingDice and Cross Entropy-Based Losses to Handle Class Imbalanced Medical Image Segmentation.”Computerized Medical Imaging and Graphics 95 (January): 102026. https://doi.org/10.1016/j.compmedimag.2021.102026.
Zhang, Xiangyu, Xinyu Zhou, Mengxiao Lin, and Jian Sun. 2017. “ShuffleNet: AnExtremelyEfficientConvolutionalNeuralNetwork for MobileDevices.” arXiv. https://doi.org/10.48550/arXiv.1707.01083.